Back

npj Digital Medicine

Springer Science and Business Media LLC

Preprints posted in the last 90 days, ranked by how well they match npj Digital Medicine's content profile, based on 97 papers previously published here. The average preprint has a 0.21% match score for this journal, so anything above that is already an above-average fit.

1
Patient-Centric Markov-Chain Framework for Predicting Medication Adherence Using De-Identified Data

Dantuluri, A. V. S. R.

2026-02-10 health informatics 10.64898/2026.02.08.26345856 medRxiv
Top 0.1%
66.5%
Show abstract

Long-term adherence to prescribed therapies remains a persistent challenge in chronic and ultra-rare conditions where clinical outcomes depend on continuous medication use. Even brief gaps in therapy can compromise disease control, yet patients frequently encounter structural barriers including high out-of-pocket costs, prior-authorization (PA) delays, annual re-verification cycles, and refill logistics that disrupt persistence. This study evaluates a patient-centric Markov-chain framework for adherence risk stratification trained on eight years of de-identified specialty-pharmacy data representing 1,200 active patients. Certified data aggregators supply longitudinally linkable, tokenized data to preserve privacy while enabling multi-year adherence trajectory modeling. Transition probabilities between fully adherent, partially adherent, and lapsed states are estimated and adjusted using covariates such as age, duration on therapy, refill cadence, PA processing time, copay burden, and foundation-assistance status. The model achieves an accuracy of 0.82, 0.79 F1-score, and an AUC of 0.87, with 95% confidence intervals estimated via bootstrapping across cross-validated folds. Results highlight cost exposure, administrative friction, and mid-treatment duration (1-5 years) as dominant predictors of future non-adherence. Findings demonstrate how probabilistic modeling of privacy-preserved real-world data can support equitable patient-assistance strategies, identifying individuals vulnerable to systemic barriers rather than emphasizing commercial performance metrics.

2
Population differences in wearable device wear time: Rescuing data to address biases and advance health equity

Hurwitz, E.; Connelly, E.; Sklerov, M.; Master, H.; Hochheiser, H.; Butzin-Dozier, Z.; Dunn, J.; Haendel, M. A.

2026-03-06 health informatics 10.64898/2026.03.06.26347799 medRxiv
Top 0.1%
65.9%
Show abstract

Wearable devices present transformative opportunities for personalized healthcare through continuous monitoring of digital biomarkers; however, individual variations in device wear time could mask or otherwise impact signal identification. Despite the widespread adoption of wearable devices in research, no comprehensive framework exists for understanding how wear time varies across populations or for addressing wear time-related biases in analysis. Using Fitbit data from 11,901 participants in the All of Us Research Program, we conducted the first large-scale systematic assessment of wearable device wear time across demographics, social determinants of health, lifestyle factors, mental health symptoms, and disease. Our findings revealed that wear time was higher among males and increased with age, income, and education, but decreased with depressive, anxiety, and anhedonia symptoms, with reductions more pronounced following clinical diagnoses compared to symptom-based classifications. Individuals with chronic conditions displayed differential levels of wear time compared to healthy controls. Critically, we demonstrate that the widely used [≥]10-hour daily compliance threshold, while appropriate for some research contexts, can disproportionately exclude days of data from disease populations: among individuals with major depressive disorder, 74.4% of data days were excluded compared to 20.9% for controls. We propose a flexible methodological framework including standard compliance thresholds, wear time covariate adjustment, metric normalization, propensity score matching, and adaptive thresholds that can be applied individually or in combination to optimize wearable data retention across diverse research contexts. These findings establish wear time as a critical methodological consideration for wearable device research and provide guidance for advancing equitable and rigorous digital health analytics.

3
AI-Driven Zero-Touch Network Orchestration for Tele-Radiology in Resource-Constrained Environments

Javed, M. Z.; Majeed, R.; Shafeeq, U.; Usman, H.; Ahmad, M.

2026-02-16 medical education 10.64898/2026.02.13.26346260 medRxiv
Top 0.1%
64.0%
Show abstract

BackgroundThe deployment of high-fidelity diagnostic Artificial Intelligence (AI) in resource-constrained environments is hindered by the stochastic nature of network latency and bandwidth limitations. Traditional tele-radiology relies on static cloud offloading, which introduces unacceptable latency for critical care scenarios. Zero-Touch Network and Service Management (ZSM) offers a paradigm for automated network orchestration, yet current frameworks lack application-layer awareness regarding clinical urgency and image complexity. MethodologyThis study proposes a novel Cross-Modal Latent Transformer (CMLT) integrated within a Zero-Touch Network Orchestration architecture. The system utilizes a lightweight Edge-Gating mechanism to dynamically partition inference tasks between edge nodes and cloud resources based on feature entropy. The model was trained and validated on the MIMIC-CXR (v2.0.0) (n = 377, 110) and CheXpert (n = 224, 316) datasets, employing a 70/10/20 split. ResultsThe proposed orchestration framework achieved an AUC-ROC of 0.962 [95% CI: 0.941-0.983] for Atelectasis detection, comparable to full-cloud inference, while reducing network bandwidth consumption by 64.3%. McNemars test indicated no statistically significant difference in diagnostic accuracy between the orchestrated hybrid approach and the full-precision cloud baseline (p > 0.05), despite a 120 ms reduction in mean inference latency. Clinical SignificanceBy embedding clinical feature extraction directly into the network orchestration logic, this framework enables real-time, zero-touch provisioning of diagnostic resources, facilitating reliable AI deployment in rural and bandwidth-limited clinical settings.

4
Developing and Testing an Engineering Framework for Curiosity-Driven and Humble AI in Clinical Decision Support

Arslan, J.; Benke, K.; Cajas, S.; Castro, R.; Celi, L. A.; Cruz Suarez, G. A.; Delos Reyes, R.; Engelmann, J.; Ercole, A.; Hilel, A.; Kalla, M.; Kinyera, L.; Lange, M.; Lunde, T. M.; Meni, M. J.; Ocampo Osorio, F.; Premo, A.; Sedlakova, J.; Vig, P.

2026-02-07 health informatics 10.64898/2026.02.06.26345664 medRxiv
Top 0.1%
63.7%
Show abstract

BackgroundWe present BODHI (Balanced, Open-minded, Diagnostic, Humble, and Inquisitive), an engineering framework for curiosity-driven and humble clinical decision support AI. Despite growing capabilities, large language models (LLMs) often express inappropriate confidence, conflating statistical pattern recognition with genuine medical understanding. BODHI addresses this through a dual-reflective architecture that: (1) decomposes epistemic uncertainty into task-specific dimensions, and (2) constrains model responses using virtue-based stance rules derived from a Virtue Activation Matrix. MethodsWe validate the framework through controlled evaluation on 200 clinical vignettes from HealthBench Hard, assessing GPT-4o-mini and GPT-4.1-mini across 5 random seeds (1,800 total observations). Statistical analysis included bootstrap resampling, paired t-tests, and effect size computation (Supplementary Materials S3) FindingsBODHI significantly improved overall clinical response quality (GPT-4.1-mini: +17.3pp, p < 0.0001, Cohens d = 0.50; GPT-4o-mini: +7.4pp, p < 0.0001, Cohens d = 0.22) while achieving very large effect sizes on curiosity (context-seeking rate: Cohens d = 16.38 and 19.54) and humility (hedging: d = 5.80 for GPT-4.1-mini) metrics. Crucially, 97.3% of GPT-4.1-mini responses and 73.5% of GPT-4o-mini responses included appropriate clarifying questions, compared to 7.8% and 0.0% at baseline, demonstrating the frameworks effectiveness in eliciting information-gathering behavior. InterpretationThese findings suggest LLMs can be reliably constrained to operate within epistemic boundaries when provided with structured uncertainty decomposition and virtue-aligned response rules, offering a pathway toward safer clinical AI deployment.

5
Learning Patient-Specific Event Sequence Representations for Clinical Process Analysis

Solyomvari, K.; Antikainen, T.; Moen, H.; Marttinen, P.; Renkonen, R.; Koskinen, M.

2026-03-30 health informatics 10.64898/2026.03.25.26348333 medRxiv
Top 0.1%
61.3%
Show abstract

Healthcare system performance evaluation is constrained by episodic performance indicators and process mining techniques that fail to accommodate the scale, heterogeneity, and temporal complexity of real-world clinical pathways. Electronic health records enable reconstructing patient journeys that capture how care processes unfold across fragmented healthcare services. Here we present ClinicalTAAT, a time-aware transformer that bridges clinical sequence modeling and process mining by integrating contextual and time-varying information to learn interpretable patient-specific representations from inherently sparse, irregular and high-dimensional clinical event sequences. Evaluated on a large pediatric emergency cohort, ClinicalTAAT outperforms existing models in acuity and diagnosis classification, identifies clinically meaningful patient subgroups in heterogenous population with distinct acuity, resource utilization and diagnostic patterns, and detects anomalies in individual care trajectories. These findings demonstrate that time-aware transformers can complement existing process mining methodologies and serve as foundation models for clinical process analysis, providing a scalable framework for data-driven healthcare evaluation and optimization.

6
Self-Reported Side Effects of Semaglutide and Tirzepatide in Online Communities

Sehgal, N. K. R.; Tronieri, J. S.; Ungar, L.; Guntuku, S. C.

2026-03-13 health informatics 10.64898/2026.03.12.26348253 medRxiv
Top 0.1%
58.4%
Show abstract

Social media can reveal patient experiences with glucagon-like peptide-1 receptor agonists (GLP-1 RAs) that extend beyond clinical trial data. We analyzed 410,198 Reddit posts (May 2019-June 2025) mentioning semaglutide or tirzepatide. A total of 67,008 users self-reported using these medications, and 43.5% described at least one side effect. Gastrointestinal symptoms predominated, including nausea (36.9%), fatigue (16.7%), vomiting (16.3%), constipation (15.3%), and diarrhea (12.6%). Notably, reproductive symptoms (e.g., menstrual irregularities) and temperature-related complaints (e.g., chills, hot flashes) emerged as unrecognized potential effects. These findings highlight patient concerns not well captured in current labeling or trials. Large-scale social media analysis can complement traditional pharmacovigilance by detecting emerging safety signals and expanding understanding of the real-world safety profile of GLP-1 RAs.

7
Cognitive AI-Assisted Primary Care Health Delivery: A Pilot Study in Bangladesh

Kabir, R. A.; Williams, M.; Rayhan, N.

2026-04-05 public and global health 10.64898/2026.04.03.26349253 medRxiv
Top 0.1%
58.2%
Show abstract

Research has documented persistent physician workforce shortages globally, with projected shortfalls threatening primary care access in underserved populations. Existing AI applications in healthcare have largely focused on predictive risk-scoring tools that generate probability estimates but do not reduce the time a physician spends completing a patient encounter. A January 2025 study further demonstrated that large language models lack the metacognitive capacity necessary for reliable medical reason ing, i.e., being able to ask appropriate questions in the absence of information to collect patient history and update differential diagnoses. This paper reports on a 2025 pilot deployment of ClinicalAssist in Bangladesh that tested a fundamentally different model: An AI system designed to replicate every step of the clinical workflow. Across 239 unique patients, 277 encounters, and 287 diagnostic opportunities, the system achieved an overall diagnostic accuracy of 94.7%, with chronic disease accuracy of 98.0% and acute care accuracy of 88.9%. These results suggest that cognitive AI has the potential to be a powerful clinical force multiplier if properly integrated in workflow.

8
From Concept to Clinic: Real World Evidence for Autonomous AI Deployment in Primary Care Telemedicine

Saenz, A. D.; Schumacher, E.; Naik, D.; Khosla, N.; Kannan, A.

2026-03-20 health informatics 10.64898/2026.03.18.26348749 medRxiv
Top 0.1%
58.1%
Show abstract

Systems powered by large language models are widely used for health information and advice, yet robust evidence for their safety and effectiveness in real-world clinical care remains lacking. Most existing studies evaluate general-purpose chatbots in artificial settings, failing to account for the critical role of system design, deployment context, and integrated safety mechanisms. Here, we report, to our knowledge, the first large-scale, clinician-blinded, real-world evaluation of a multi-agent LLM-based system deployed within a nationwide U.S. primary care telemedicine platform, assessing readiness for task-specific autonomous deployment. In 2,379 real patient encounters, where users actively sought medical care and completed full visits with licensed clinicians, we compared the AI system's intake diagnoses and disposition suggestions to those of treating clinicians, who were blinded to the AI's outputs. The AI's top-1 diagnosis matched the clinician's diagnosis in 91.3% of cases overall, increasing to 96.3% among cases meeting a pre-specified safety confidence threshold, and 97.9% in common, lower-complexity conditions that met the same confidence threshold. Disposition accuracy was similarly high, with an overall error rate of 2.5% and no errors in suggestions to emergency room or home management. These results demonstrate that purposeful system architecture, rather than model capability alone, is essential for safe and effective autonomous clinical AI. We propose a staged, task-calibrated deployment framework, in which AI can be introduced autonomously for well-defined tasks with explicit safety gating and continuous monitoring, expanding scope as real-world evidence accrues. Our findings provide the first real-world evidence of readiness for safe autonomous clinical AI and offer a practical roadmap for its responsible deployment at scale.

9
Digital journaling enables privacy-preserving behavioral phenotyping and real-time risk monitoring at scale

Milham, M.; Low, D.; Erkent, A.; Trabulsi, J.; Kass, M. C.; Vos de Wael, R.; Yenepalli, S.; Wang, Y.; Leyden, M.; Jordan, C.; Salum, G.; Alexander, L.; Schubiner, G.; Hendrix, L.; Koyama, M.; Mears, L.; McAdams, R.; White, C.; Merikangas, K.; Satterthwaite, T. D.; Franco, A.; Klein, A.; Koplewicz, H.; Leventhal, B.; Freund, M.; Kiar, G.

2026-04-08 psychiatry and clinical psychology 10.64898/2026.04.04.26349881 medRxiv
Top 0.1%
52.3%
Show abstract

Digital mental health applications enable high-frequency behavioral monitoring and scalable interventions. Journaling provides a therapeutically grounded and intrinsically engaging activity for many users. AI-based text analysis enables privacy-preserving phenotyping of clinically relevant patterns in naturalistic writing, including emotional distress and behavioral risk (e.g., indicators of intent, planning, or preparatory actions for harm to self or others). We evaluated a mobile journaling platform in an 8-week randomized controlled trial (N = 507) of young adults with mild-to-moderate anxiety and depression symptoms. Journaling produced modest reductions in anxiety relative to controls at the 8-week endpoint and 1-month follow-up (d = 0.16-0.19). Effects were small and did not remain significant after correction for multiple comparisons; complementary Bayesian models nonetheless provided moderate-to-strong directional evidence (90-97%) supporting a modest anxiety reduction. In parallel, behavioral phenotyping analyses showed that high-risk journal entries were more common among younger users (OR = 0.77 per year of age, p = 0.007). Text-based risk signals and self-reported energy exhibited significant circadian variation (e.g., risk probability was highest during late-night and overnight hours). Within-person analyses demonstrated strong short-term persistence in mood and risk states, with calm/relaxed showing the highest persistence and anxious/agitated exhibiting the lowest persistence. High-risk journal entries clustered temporally and were preceded by sustained low valence and energy. Although affective volatility was associated with acute declines within the same affective dimension (pleasantness or energy), it was not associated with escalation to high-risk states. Key behavioral dynamics observed in the trial were replicated in an independent general population dataset (N = 16,630). Collectively, these findings demonstrate that privacy-preserving digital journaling can support scalable longitudinal behavioral phenotyping and real-time risk monitoring while providing modest clinical benefit for anxiety symptoms.

10
Personalized Insights Derived from Wearable Device Data and Large Language Models to Improve Well-Being

He, K.; Fang, Y.; Frank, E.; Li, C.; Bohnert, A.; Sen, S.; Wang, M.

2026-03-04 health informatics 10.64898/2026.03.03.26347299 medRxiv
Top 0.1%
51.3%
Show abstract

Health behaviors such as physical activity and sleep affect mental health, but the effect of each health behavior varies substantially across individuals, limiting the usefulness of generic behavioral recommendations. We collected one year of continuous wearable and ecological momentary assessment data from 3,139 participants in the Intern Health Study (2018-2023), and examined individual-level associations between wearable-derived features and mood across the internship year. The behaviors associated with mood were highly heterogeneous between individuals: the two most prevalent drivers of mood were wake-up time (the strongest driver for 34.0% of subjects) and step count (10.6% of subjects). The correlation directionality remained largely stable despite fluctuations in strength. Interestingly, 20.3% of subjects showed no significant correlations. These findings highlight the limitations of population-level recommendations and the critical need for personalized, data-driven approaches to mental health assessment and intervention. To translate these personalized insights into actionable support, we developed MoodDriver, a large language models (LLM)-powered system that generates tailored feedback emails based on each participants behavioral and physiological patterns. This work demonstrates the feasibility of combining digital phenotyping with large language models to advance precision digital mental health for high-risk populations.

11
Artificial Intelligence for Automated, Highly Accurate, and Scalable Multimodal EHR Data Abstraction

Margaritis, G.; Petridis, P.; Bertsimas, D.; Bloom, J.; Hagberg, R.; Habib, R.; Shahian, D. M.; Orfanoudaki, A.

2026-03-17 health informatics 10.64898/2026.03.16.26348522 medRxiv
Top 0.1%
49.7%
Show abstract

Electronic health records (EHRs) contain rich multimodal data but remain underutilized for populating clinical registries due to the time and cost of manual abstraction. We developed an AI-driven pipeline to automate data abstraction for variables in the Society of Thoracic Surgeons Adult Cardiac Surgery Database (ACSD). Models were developed using Mass General Brigham data and externally validated on Hartford HealthCare data. The pipeline processes ten clinical EHR sources, seven unstructured text types and three structured data types; each encoded using two language-model embeddings and term frequency-inverse document frequency. This approach yielded 30 source-specific models per target variable whose predictions were aggregated by an ensemble meta-learner, followed by a dual-threshold confidence framework that enforced registry-grade high accuracy standards and deferred uncertain predictions to human review. The developed pipeline achieved an overall accuracy exceeding 99% across 647 registry variables, while automatically completing 49.5% and 43.2% of variables at both sites, respectively. These results demonstrate that AI-assisted abstraction can substantially reduce clinical registry data collection burden while maintaining high accuracy.

12
Socially Grounded Exemplars Improve Synthetic Conversations for Health-Related Social Needs Navigation

Hussain, S.-A.; Jackson, D. I.; Thotapalli, S.; McClellan, M. B.; Stanco, M.; Varney, G.; Gleeson, S.; Nugroho, F.; Leever, W.; Fosler-Lussier, E.; Sezgin, E.

2026-02-02 health informatics 10.64898/2026.01.30.26345239 medRxiv
Top 0.1%
46.7%
Show abstract

Health-Related Social Needs (HRSNs) significantly impact health outcomes, yet traditional care often fails to address them effectively. While conversational agents offer scalable support, their deployment is hindered by privacy risks and a lack of specialized training data for clinical applications. Synthetic data generation offers a solution to address this gap; standard pipelines often prompt LLMs using structured user personas, comprising demographics, constraints, and goals, to emulate dialogues. However, current methods relying on coarse demographic attributes often yield generic or stereotyped personas that lack real-world nuance. To improve the realism of synthetic data, we introduce Socially Grounded Exemplars (SGEs), which translate abstract persona attributes into granular, conversational descriptors. We implemented a two-stage pipeline using GPT-4o to generate SGEs, which then grounded synthetic dialogue generation under various prompting strategies. We evaluated the approach using automatic diversity metrics (Vendi Score) and blinded pairwise preference ratings by community behavioral health specialists (CBHS). Validation confirmed the feasibility of input generation, with GPT-4o achieving an 85% term acceptability rate for SGEs. In conversation generation, dynamic SGEs significantly improved lexical diversity, achieving a Vendi Score of 289.41 compared to 252.36 for the control baseline. CBHS ranked the model combining dynamic SGEs with implicit name-based cueing highest (Bradley-Terry Score: 0.753), surpassing both the SGE-only model (0.663) and the explicit demographics model (0.348). Raters favored the name-augmented model for "Specificity & Natural Authenticity" (30.0%), while explicit demographic labeling reduced perceived authenticity. We show SGEs leverage LLM parametric knowledge to produce diverse synthetic data, surpassing the limitations of rigid demographic ontologies. Our findings indicate that implicit cueing through names yields more authentic representations than explicit labeling, reducing the risk of stereotyped outputs. This framework supports the creation of privacy-preserving, conversational datasets informing tasks (e.g. evaluation, agentic workflows, and model distillation) in sensitive healthcare contexts.

13
Multinational Validation of the Intensive Documentation Index for ICU Mortality Prediction: Temporal Resolution and ICU Mortality

Collier, A.; Shalhout, S. Z.

2026-03-23 health informatics 10.64898/2026.03.19.26348852 medRxiv
Top 0.1%
43.1%
Show abstract

Clinical documentation timestamps generate a continuous, zero-burden behavioral signal in the electronic health record. We developed the Intensive Documentation Index (IDI) and validated it in two independent cohorts: MIMIC-IV (26,153 U.S. ICU heart failure patients, primary outcome in-hospital mortality) and HiRID (33,897 Swiss all-ICU patients, primary outcome ICU mortality). In MIMIC-IV, the IDI-enhanced logistic regression achieved an AUROC of 0.6491, compared with a baseline of 0.6242 (Brier score of 0.1299). In HiRID, where documentation latency is 1.2 minutes, compared with 15 hours in MIMIC-IV, AUROC was 0.9063, well above published APACHE IV and SAPS III benchmarks. The approximately 0.27 AUROC gap reflects the importance of temporal granularity in documentation-based risk stratification. IDI requires no physiologic measurements, making it complementary to established severity scores. Prospective validation in real-time EHR systems is required before clinical deployment.

14
Patient-Centred Communication in Lung Cancer Screening: A Clinically Focussed Evaluation of a Fine-Tuned Open-Source Model Against a Larger Frontier System

Khanna, S.; Chaudhary, R.; Narula, N.; Lee, R.

2026-04-11 oncology 10.64898/2026.04.10.26350595 medRxiv
Top 0.1%
43.1%
Show abstract

Lung cancer screening saves lives, yet uptake remains suboptimal and inequitable. Personalised communication can improve attendance and reduce anxiety, but scaling such support is a workforce challenge. We fine-tuned Googles Gemma 2 9B using QLoRA on 5,086 synthetic screening conversations and compared it against Googles Gemini 2.5 Flash (a larger frontier model) and an unmodified baseline across 300 multi-turn conversations with 100 patient personas spanning ten clinical categories. Evaluation combined automated natural language processing metrics with independent language model judgement in two complementary modes: structured clinical rubric and simulated patient persona. The fine-tuned model achieved the highest simulated patient experience score (3.71/5 vs 3.65 for the frontier model), recorded zero boundary violations after clinician review of all flagged instances, and led on the four most safety-critical categories. A composite Patient Adaptation Index showed that the fine-tuned model led overall (0.37 vs 0.35 vs 0.35), with its clearest advantage on the two clinically specific components: empathy calibration to patient distress and selective smoking cessation signposting. These findings suggest that targeted fine-tuning of open-source models can yield clinical communication quality comparable to larger proprietary systems, with advantages in safety-critical scenarios and suitability for NHS data governance constraints. Human clinician review of these conversations is ongoing.

15
A clinic-updated digital twin for Parkinson's disease progression: governed Bayesian forecasting with uncertainty-gated reporting

Hemedan, A. A.

2026-03-22 health informatics 10.64898/2026.03.19.26348807 medRxiv
Top 0.1%
42.8%
Show abstract

BackgroundClinical digital twins hold considerable promise for forecasting disease progression, yet the question of when a models outputs should be withheld remains largely unaddressed. A predictive model qualifies as a governed reporting system only when it specifies the operational boundaries under which its outputs are reliable and enforces criteria for suppressing results that fall outside those bounds. MethodsWe present a governed Bayesian digital twin for multi-domain Parkinsons disease (PD) progression, tracking motor function (MDS-UPDRS Part III), cognition (Montreal Cognitive Assessment, MoCA), and autonomic function (SCOPA-AUT). A monotone latent state-space model captures disease progression under four architectural constraints: non-decreasing latent severity, visit-triggered updating, full posterior uncertainty propagation, and non-causal scope. A six-rule confidence gate evaluates each forecast before release; when evidence is insufficient, the gate suppresses the output and returns a structured reason code. We evaluated the framework on the Parkinsons Progression Markers Initiative (PPMI), a multicentre longitudinal observational study (N=4,628 participants; 28,185 visits), using five-fold cross-validation with independent model refits, equity analysis, and coupling-topology sensitivity assessment. The framework is available at https://gitlab.com/ahmed.hemedan/symphony-dt, with a research prototype at https://symphony-dt.com/. ResultsPredictive interval coverage at the 95% level ranged from 94% to 96% across all three endpoints, compared with 64-69% for linear mixed-effects baselines. The confidence gate released governed forecasts at 32.7% of visits under strict three-domain requirements, increasing to 48.1% under a validated partial-observation extension. Suppression was predominantly driven by incomplete clinical assessment (51.5%) rather than model uncertainty (0.2%), and operated equitably across sexes (Cramers V=0.049). Five of six cross-domain coupling parameters were identified from the data (sign probability [&ge;] 0.99; contraction ratios 0.19-0.35), with all cross-domain forecast correlations matching the directions predicted by the coupling topology. The frameworks own diagnostics localised two observation-model limitations, Prodromal motor heteroscedasticity and medication-burden sensitivity, to a single model layer and specified their resolution. ConclusionsGoverned silence, defined as the rule-based suppression of predictions when reliability conditions are not met, can be embedded in clinical prediction architecture, quantified as a pipeline output, and audited for equity. This work demonstrates the technical executability of governed digital twin architecture at cohort scale and provides a foundation for prospective deployment under routine clinical conditions.

16
Multimodal prediction of visual improvement in diabetic macular edema using real-world electronic health records and optical coherence tomography images

Sun, S.; Cai, C. X.; Fan, R.; You, S.; Tran, D.; Rao, P. K.; Suchard, M. A.; Wang, Y.; Lee, C. S.; Lee, A. Y.; Zhang, L.

2026-04-24 health informatics 10.64898/2026.04.23.26351616 medRxiv
Top 0.1%
42.8%
Show abstract

Multimodal learning has the potential to improve clinical prediction by integrating complementary data sources, but the incremental value of imaging beyond structured electronic health record (EHR) data remains unclear in real-world settings. We developed a multimodal survival modeling framework integrating optical coherence tomography (OCT) and EHR data to predict time to visual improvement in patients with diabetic macular edema (DME), and evaluated how different ophthalmic foundation model representations contribute to prognostic performance. In a retrospective cohort of 973 patients (1,450 eyes) receiving anti-vascular endothelial growth factor therapy, we compared multimodal models combining 22,227 EHR variables with 196,402 OCT images, with OCT embeddings derived from three ophthalmic foundation models (RETFound, EyeCLIP, and VisionFM). The EHR-only model showed minimal prognostic discrimination (C-index 0.50 [95% CI, 0.45-0.55]). Incorporating OCT improved performance, with the magnitude of improvement depending on the representation. EHR+RETFound achieved the strongest performance (C-index 0.59 [0.54-0.65]), followed by EHR+EyeCLIP (0.57 [0.52-0.62]) and EHR+VisionFM (0.56 [0.51-0.61]). Multimodal models, particularly EHR+RETFound, demonstrated improved risk stratification with clearer separation of Kaplan-Meier curves. Partial information decomposition revealed that prognostic information was dominated by modality-specific contributions, with OCT and EHR providing largely distinct signals and minimal shared information. The magnitude of OCT-specific contribution varied across foundation models and aligned with observed performance differences. These findings indicate that OCT provides complementary prognostic value beyond structured clinical data, but gains are modest and depend strongly on representation choice. Our results highlight both the promise of multimodal modeling for personalized prognosis and the need for rigorous, context-specific evaluation of foundation models in real-world clinical settings.

17
LLM-Driven Target Trial Emulation with Human-in-the-Loop Validation for Randomized Trial: Automated Protocol Extraction and Real-World Outcome Evaluation{Psi}

Dey, S. K.; Qureshi, A. I.; Shyu, C.-R.

2026-04-13 health informatics 10.64898/2026.04.09.26350523 medRxiv
Top 0.1%
42.3%
Show abstract

Target trial emulation (TTE) enables causal inference from observational data but remains bottlenecked by manual, expert-dependent protocol operationalization. While large language models (LLMs) have advanced clinical knowledge extraction and code generation, their ability to automate end-to-end TTE workflows remains largely unexplored. We present an LLM-driven framework using retrieval-augmented generation to extract the five core TTE design parameters from the Carotid Revascularization and Medical Management for Asymptomatic Carotid Stenosis Trial (CREST-2) protocol and generate executable phenotyping pipelines for real-world EHR data. The performance of the framework was evaluated along two dimensions. First, protocol extraction accuracy was assessed against a gold-standard checklist of trial design components using precision, recall, and F1-score metrics. Second, outcome validity was evaluated through population-level concordance analyses comparing EHR-derived outcomes with published trial endpoints using standardized mean difference, observed-to-expected ratios, confidence interval overlap, and two-proportion z-tests. Further, Human-in-the-loop validation assessed the correctness of extracted clinical logic and phenotype definitions. Together, these evaluations demonstrate a structured approach for assessing LLM-driven protocol-to-pipeline translation for scalable real-world evidence generation.

18
Representation Before Retrieval: Structured Patient Artifacts Reduce Hallucination in Clinical AI Systems

Scanlin, J.; Cuesta, A.; Varsavsky, M.

2026-02-16 health informatics 10.64898/2026.02.13.26346256 medRxiv
Top 0.1%
41.5%
Show abstract

BackgroundLarge language models show promise for clinical decision support, yet their propensity for hallucination--generating plausible but unsupported claims--poses sub-stantial patient safety risks. Retrieval-augmented generation (RAG) is widely assumed to mitigate this problem by grounding outputs in retrieved documents, but this assumption remains inadequately tested in clinical contexts where information density, temporal complexity, and safety stakes are uniquely high. MethodsWe developed a system that compiles heterogeneous patient data (electronic health records, wearables, genomics, imaging reports) into structured, machine-readable artifacts with explicit provenance tracking across seven clinical domains. We evaluated four conditions: baseline LLM (C0), RAG over raw clinical text (C1), artifact-augmented single-pass generation (C2), and artifact-augmented multi-step agent workflow with verification (C3). Using 100 synthetic patient vignettes evaluated across 3 random seeds (N = 300 per condition, 1,200 total), we measured unsupported claim rates, factual accuracy, temporal consistency, contraindication detection, and clinical safety metrics using GPT-4o-mini with physician-adjudicated safety review. ResultsRAG substantially increased hallucination: unsupported claim rates rose from 5.0% (95% CI: 3.8-6.4%) at baseline to 43.6% (95% CI: 40.1-47.2%) with retrieval--an 8.7-fold increase (p < 0.001, Cohens d = 2.31). Structured artifacts reduced unsupported claims to 8.4% (95% CI: 6.7-10.3%) in single-pass generation, a 40% relative reduction versus baseline (p = 0.02, d = 0.48). The agent workflow achieved 21.1% unsupported claims with the lowest contraindication miss rate (0.04) and highest clinician utility scores. Ablation analysis revealed that citation requirements and constraint checking contributed most to safety improvements. ConclusionsContrary to prevailing assumptions, RAG increases rather than decreases hallucination in clinical text generation. Structured representation with explicit provenance offers a more effective approach to grounding LLM outputs in verifiable patient data. We propose an information-theoretic framework explaining why representation quality determines the ceiling on factual reliability, while agentic verification affects uncertainty handling and safety constraint enforcement.

19
The Clinician Model Card: development and evaluation of clinician-centered documentation for AI-based clinical decision support

Agha-Mir-Salim, L.; Frey, N.; Kaiser, Z.; Mosch, L.; Weicken, E.; Freyer, O.; Ma, J.; Mittermaier, M.; Meyer, A.; Gilbert, S.; Muller-Birn, C.; Balzer, F.

2026-04-17 health informatics 10.64898/2026.04.15.26350930 medRxiv
Top 0.1%
41.5%
Show abstract

AI documentation frameworks remain poorly designed for point-of-care use, leaving clinicians without actionable information on how to use clinical AI models when they need it most. We developed the Clinician Model Card, an interactive, clinician-centered documentation tool, and evaluated it in a sequential exploratory mixed-methods study: interviews with 12 physicians informed iterative co-design, evaluated in a national survey of 129 physicians across Germany. The tool was well-received: 84% agreed it should be routinely available, and 66% considered its content relevant to clinical decision-making. Yet comprehensibility of statistical performance metrics remained poor despite targeted interventions: only 32% understood the Validation & Performance section well, and fewer than 54% correctly interpreted AUROC or PPV, with AI literacy as strong predictor of comprehension. We propose empirically derived design principles for clinician-centered AI documentation. Effective AI transparency requires not only clinician-friendly design and workflow integration, but sustained investment in AI literacy.

20
Multi-Model Clinical Validation of an AI-Powered Biomarker Analysis Framework: A Cross-Vendor Benchmark on 4,018 NHANES Patients

Shibakov, D.

2026-02-17 health informatics 10.64898/2026.02.13.26346284 medRxiv
Top 0.1%
41.4%
Show abstract

BackgroundLarge language models (LLMs) show promise for clinical decision support, yet most validation studies evaluate single models, leaving questions about generalizability and vendor dependence unanswered. We assessed whether a standardized biomarker analysis framework maintains clinical-grade accuracy across multiple LLMs from independent providers. MethodsWe developed a structured prompt-based framework for detecting eight clinical patterns (insulin resistance, diabetes, cardiovascular disease risk, chronic kidney disease risk, systemic inflammation, nutrient deficiency, liver risk, and anemia) from laboratory biomarkers. We evaluated five LLMs from four providers--Grok-3 (xAI), GPT-4o and GPT-4o-mini (OpenAI), Claude Haiku 4.5 (Anthropic), and Gemini 2.0 Flash (Google)--using identical system prompts and inputs on 4,018 adults from the CDC NHANES 2017-2018. Ground truth was established using published clinical criteria (ADA, AHA, KDIGO, WHO). Performance was measured by F1 score with 95% confidence intervals, sensitivity, specificity, and positive predictive value. ResultsAll five models achieved clinical-grade performance (F1 > 0.86) on eight evaluable patterns. Mean F1 scores ranged from 0.865 (95% CI: 0.799-0.931) for GPT-4o-mini to 0.963 (95% CI: 0.930-0.996) for Grok-3. Flagship models significantly outperformed economy-tier models (mean F1: 0.940 vs 0.881; paired t-test p=0.004). Grok-3 achieved near-perfect scores on liver risk (F1=1.000), anemia (0.999), and nutrient deficiency (0.997). Cardiovascular disease risk was the most challenging pattern (F1 range: 0.853-0.885). JSON parse rates exceeded 99.9% for all models. Total benchmark cost was approximately $59 USD. ConclusionsA standardized prompt-based framework achieves clinical-grade accuracy across five LLMs from four independent providers, demonstrating model-agnostic generalizability. These findings support the feasibility of vendor-independent clinical AI systems that can leverage multiple models without requiring framework revalidation.